Joint Audio-Visual Unit Selection – the JAVUS Speech Synthesizer

نویسنده

  • Sascha Fagel
چکیده

The author presents a system for speech synthesis that selects and concatenates speech segments (units) of various size from an adequately prepared audio-visual speech database. The audio and the video track of selected segments are used together in concatenation to preserve audio-visual correlations. The input text is converted into a target phone chain and the database is searched for appropriate segments representing sub-chains of at least two phones that can be concatenated to the target utterance. The final segment sequence is selected from the possible segment sequences by a weighted sum of concatenation criteria for the audio and the video join. The weights of these audio and video join costs can be used to trade off between fluency in the audio and the video channel of the synthesized speech. The output shows the input text audio-visually spoken where the audio and the video track are reasonably fluent, synchronous, and intelligible.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Study on Unit-Selection and Statistical Parametric Speech Synthesis Techniques

One of the interesting topics on multimedia domain is concerned with empowering computer in order to speech production. Speech synthesis is granting human abilities to the computer for speech production. Data-based approach and process-based approach are the two main approaches on speech synthesis. Each approach has its varied challenges. Unit-selection speech synthesis and statistical parametr...

متن کامل

Introducing visual target cost within an acoustic-visual unit-selection speech synthesizer

In this paper, we present a method to take into account visual information during the selection process in an acoustic-visual synthesizer. The acoustic-visual speech synthesizer is based on the selection and concatenation of synchronous bimodal diphone units i.e., speech signal and 3D facial movements of the speaker’s face. The visual speech information is acquired using a stereovision techniqu...

متن کامل

Audio-Visual Unit Selection for the Synthesis of Photo-Realistic Talking-Heads

This paper investigates audio-visual unit selection for the synthesis of photo-realistic, speech-synchronized talking-head animations. These animations are synthesized from recorded video samples of a subject speaking in front of a camera, resulting in a photo-realistic appearance. The lip-synchronization is obtained by optimally selecting and concatenating variable-length video units of the mo...

متن کامل

A hidden Markov model based visual speech synthesizer

This paper describes a hidden Markov model (HMM) based visual synthesizer designed to assist persons with impairedhearing. This synthesizer builds on results in the area of audio-visual speech recognition. We describe how a correlation HMM can be used to integrate independent acoustic and visual HMMs for speech-to-visual synthesis. Our results show that an HMM correlating model can signi cantly...

متن کامل

Unit Size in Unit Selection Speech Synthesis

In this paper, we address the issue of choice of unit size in unit selection speech synthesis. We discuss the development of a Hindi speech synthesizer and our experiments with different choices of units: syllable, diphone, phone and half phone. Perceptual tests conducted to evaluate the quality of the synthesizers with different unit size indicate that the syllable synthesizer performs better ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006